# Instruction Pipelining: Hazard Resolution, Timing Constraints

Mengjia Yan
Computer Science and Artificial Intelligence Laboratory
M.I.T.

#### Pipeline Diagram - Ideal Pipelining

Over long term, i.e., in steady state, what are the ...

CPI? 1 IPC? 1 Inst exec time? 5

Num inst in flight? 
$$\bar{T} = \frac{\bar{N}}{\bar{L}}$$
 -> 5

### Reminder: Pipelined RISC-V Datapath without jumps



Pipelining increases clock frequency, but instruction dependences may increase CPI

# How instructions can interact with each other in a pipeline

- An instruction in the pipeline may need a resource being used by another instruction in the pipeline

   > structural hazard
- An instruction may <u>depend</u> on a value produced by an earlier instruction
  - Dependence may be for a data calculation
    - → data hazard
  - Dependence may be for calculating the next PC

→ control hazard (branches, interrupts)

#### Data Hazards



September 20, 2023 MIT 6.5900 Fall 2023 L05-5

a1 is stale. Oops!

### Pipeline Diagram – Hazard

#### time $(I_2) a3 \leftarrow a1 + 12$





MIT 6.5900 Fall 2023 September 20, 2023 L05-6

#### Resolving Data Hazards

Strategy 1: Wait for the result to be available by freezing earlier pipeline stages → stall

Strategy 2: Route data as soon as possible after it is calculated to the earlier pipeline stage > bypass

Strategy 3: Speculate on the dependence

Two cases:

Guessed correctly → no special action required Guessed incorrectly → kill and restart

#### Resolving Data Hazards

Strategy 1: Wait for the result to be available by freezing earlier pipeline stages → stall

Stall DEC & IF when instruction in DEC reads a register that is written by any earlier in-flight instruction (in EXE, MEM, or WB)

### Reminder: Stall Control Logic ignoring jumps & branches



Compare the source registers of the instruction in the decode stage with the destination register of the uncommitted instructions.

### Reminder: Stall Control Logic

ignoring jumps & branches



Should we always stall if the rs field(s) matches some rd?

not every instruction writes a register  $\Rightarrow$  we (write enable) not every instruction reads a register  $\Rightarrow$  re (read enable)

### Source & Destination Registers

| R-type: | funct7    | rs2   | rs1 | funct3 | rd       | opcode |
|---------|-----------|-------|-----|--------|----------|--------|
| I-type: | imm[11:0] |       | rs1 | funct3 | rd       | opcode |
|         | imm[11:5  | ] rs2 | rs1 | func3  | imm[4:0] | opcode |

|             |                                                   | source(s) | destination |
|-------------|---------------------------------------------------|-----------|-------------|
| ALU         | $rd \leftarrow (rs1) func (rs2)$                  | rs1, rs2  | rd          |
| ALUi        | $rd \leftarrow (rs1) \text{ funct SXT(imm)}$      | rs1       | rd          |
| LW          | $rd \leftarrow M [(rs1) + imm.disp]$              | rs1       | rd          |
| SW          | $M[(rs1) + imm.disp] \leftarrow (rs2)$            | rs1, rs2  |             |
| BEQ         | cond(rs1) = = (rs2)                               |           |             |
|             | <i>true</i> : $PC \leftarrow (PC) + SXT(imm)$     | rs1, rs2  |             |
|             | false: $PC \leftarrow (PC) + 4$                   | rs1, rs2  |             |
| JAL         | $rd \leftarrow (PC+4), PC \leftarrow (PC) + imm$  |           | rd          |
| <b>JALR</b> | $rd \leftarrow (PC+4), PC \leftarrow (rs1) + imm$ | * rs1     | rd          |

\*More precise:  $pc \leftarrow \{(reg[rs1] + imm)[31:1], 1'b0\}$ 

#### Deriving the Stall Signal

```
C_{dest}

we = Case opcode

ALU, ALUi, LW, JAL, JALR

\Rightarrow (rd \neq 0)

\Rightarrow off
```

```
 \begin{array}{c} C_{re} \\ re1 = \textit{Case} \text{ opcode} \\ \text{ALU, ALUi,} \\ \text{LW, SW, BEQ,} \\ \text{JR, JALR} \\ \Rightarrow \text{on} \\ \\ re2 = \textit{Case} \text{ opcode} \\ \text{ALU, SW, BEQ} \\ \Rightarrow \text{off} \\ \\ \end{array}
```

```
C_{stall} stall = ((rs1_D == rd_E) \cdot we_E + (rs1_D == rd_M) \cdot we_M + (rs1_D == rd_W) \cdot we_W) \cdot re1_D + ((rs2_D == rd_E) \cdot we_E + (rs2_D == rd_M) \cdot we_M + (rs2_D == rd_W) \cdot we_W) \cdot re2_D
```

This is not story!

#### Hazards due to Loads & Stores



#### Load & Store Hazards

$$(a1)+7 = (a3)+5 \Rightarrow data\ hazard$$

However, the hazard is avoided because our memory system completes writes in one cycle!

Load/Store hazards are sometimes resolved in the pipeline and sometimes in the memory system itself.

More on this later in the course.

### Resolving Data Hazards (2)

Strategy 2:

Route data as soon as possible after it is calculated to the earlier pipeline stage → bypass

#### Bypassing

Each stall or kill introduces a bubble  $\Rightarrow CPI > 1$ When is data actually available? At Execute

A new datapath, i.e., a bypass, can get the data from the output of the ALU to its input

### Adding a Bypass



When does this bypass help?

$$(I_1)$$
  $a1 \leftarrow a0 + 10$   
 $(I_2)$   $a3 \leftarrow a1 + 12$   
September 20, 2023

$$a1 \leftarrow M[a0 + 10]$$
  
 $a3 \leftarrow a1 + 12$   
MIT 6.5900 Fall 2023



#### The Bypass Signal

Deriving it from the Stall Signal

```
stall = \frac{((rs1_D = rd_E) we_E + (rs1_D = rd_M) \cdot we_M + (rs1_D = rd_W) \cdot we_W) \cdot re1_D}{+((rs2_D = rd_E) \cdot we_E + (rs2_D = rd_M) \cdot we_M + (rs2_D = rd_W) \cdot we_W) \cdot re2_D}
```

```
we = Case opcode
ALU, ALUi, LW, JAL, JALR
\Rightarrow (rd \neq 0)
... \Rightarrow off
```

$$ASrc = (rs1_D = = rd_E) \cdot we_E \cdot re1_D$$

Is this correct?

No, only ALU and ALUi instructions can benefit from this bypass. Memory and JAL\* instructions cannot.

How might we address this?

Split we<sub>F</sub> into two components: we-bypass, we-stall

#### Bypass and Stall Signals

#### Split we<sub>E</sub> into two components: we-bypass, we-stall

```
we-bypass<sub>E</sub> = Case opcode<sub>E</sub>
ALU, ALUi \Rightarrow (rd \neq 0)
... \Rightarrow off
```

```
 \begin{array}{ll} \text{we-stall}_{\text{E}} = \textit{Case} \ \text{opcode}_{\text{E}} \\ \text{LW, JAL, JALR} \ \Rightarrow (\text{rd} \neq 0) \\ \dots & \Rightarrow \text{off} \end{array}
```

```
ASrc = (rs1_D = = rd_E) \cdot re1_D \cdot we-bypass_E
```

```
stall = ((rs1_D == rd_E) \cdot we-stall_E + \\ (rs1_D == rd_M) \cdot we_M + (rs1_D == rd_W) \cdot we_W) \cdot re1_D \\ + ((rs2_D == rd_E) \cdot we_E + (rs2_D == rd_M) \cdot we_M + (rs2_D == rd_W) \cdot we_W) \cdot re2_D
```

### Full Bypass Datapath



Is there still a need for the stall signal?

stall = 
$$(rs1_D = = rd_E) \cdot (opcode_E = = LW_E) \cdot (rd_E \neq 0) \cdot re1_D + (rs2_D = = rd_E) \cdot (opcode_E = = LW_E) \cdot (rd_E \neq 0) \cdot re2_D$$

### Full Bypass Datapath



Challenge: How to scale bypass datapaths for more pipeline stages? More complex ISAs?

### Resolving Data Hazards (3)

Strategy 3:

Speculate on the dependence. Two cases:

Guessed correctly → no special action required

Guessed incorrectly → kill and restart

#### Instruction to Instruction Dependence

- What do we need to calculate next PC?
  - For Jumps (JAL)
    - Opcode, offset, and PC
  - For Jump Register (JALR)
    - Opcode, offset, register value
  - For Conditional Branches (e.g., BEQ)
    - Opcode, offset, PC, and register (for condition)
  - For all others (e.g., arithmetic insts)
    - Opcode and PC
- In what stage do we know these?
  - PC → Fetch
  - Opcode, offset → Decode (or Fetch?)
  - Register value → Decode
  - Branch condition ((rs1)== (rs2)) → Execute (or Decode?)

#### NextPC Calculation Bubbles

```
time
                                               t3 t4 t5 t6 t7
                           t0
                                  t1
                                        t2
(I_1) a1 \leftarrow (a0) + 10 IF_1 ID_1 EX_1
                                               MA<sub>1</sub> WB<sub>1</sub>
                            IF<sub>2</sub> IF<sub>2</sub>
                                                ID<sub>2</sub> EX<sub>2</sub> MA<sub>2</sub> WB<sub>2</sub>
(I_2) a3 \leftarrow (a2) + 17
                                                             ID<sub>3</sub> EX<sub>3</sub> MA<sub>3</sub> WB<sub>3</sub>
                                                IF_3
                                                      IF_3
(I_3)
                                                                          ID<sub>4</sub> EX<sub>4</sub> MA<sub>4</sub> WB<sub>4</sub>
(I_4)
                          time
                          t0
                                               t3
                                                             t5
                                                                   t6
                                                                        t7 ....
                                        t2
                                                   t4
                    ΙF
                                 nop
                                        I_2
                                               nop
                                                      I_3
                                                             nop
 Resource
                    ID
                                               I_2
                                        nop
                                                      nop
                                                            I_3
                                                                    nop
 Usage
                    FX
                                               nop I_2 nop I_3
                                                                           nop I₄
 Diagram
                    MA
                                                      nop
                                                            I_2
                                                                    nop I_3 nop
                    WB
                                                                   I_2
                                                                           nop I<sub>3</sub>
                                                             nop
                                                                                        nop I_4
                                                               nop ⇒ pipeline bubble
```

What's a good guess for next PC? PC+4

#### Speculate NextPC is PC+4



#### Speculate NextPC is PC+4





What happens on mis-speculation, i.e., when next instruction is not PC+4?

How?

### Pipelining Jumps



### Pipelining Jumps





 $\begin{array}{ccc} IRSrc_D = \textit{Case} & opcode_D \\ JAL, JALR & \Rightarrow nop \\ ... & \Rightarrow IM \end{array}$ 

#### Jump Pipeline Diagrams

```
time
                                         t2 t3 t4 t5 t6 t7 ....
                           t0
                                  t1
                           \mathsf{IF}_1
                                  ID<sub>1</sub> EX<sub>1</sub> MA<sub>1</sub> WB<sub>1</sub>
(I_1) 096: ADD
(I<sub>2</sub>) 100: JAL ra 200
                                         ID<sub>2</sub>, EX<sub>2</sub> MA<sub>2</sub> WB<sub>2</sub>
                                  IF_2
(I_3) 104: ADD
                                         IF<sub>3</sub> nop nop nop
(I_4) 304: ADD
                                                IF<sub>4</sub> ID<sub>4</sub> EX<sub>4</sub> MA<sub>4</sub> WB<sub>4</sub>
                           time
                           t0
                                                             t5
                                                                     t6 t7
                                         t2
                                                t3
                                                       t4
                                  t1
                           I_1
                    ΙF
                                  I_2
                                         I_3
                                             {
m I}_4
                                                       I_5
 Resource
                    ID
                                                      I_4
                                         I_2
                                                              I_5
                                             nop
 Usage
                    EX
                                                I_2
                                                       nop
 Diagram
                    MA
                                                       I_2 nop I_4
                    WB
                                                              I_2
                                                                     nop I_4
```

September 20, 2023 MIT 6.5900 Fall 2023 L05-29

nop ⇒ pipeline bubble

#### Pipelining Conditional Branches





104 ADD

**ADD** 

304

ADD the execute stage. What action BEQ a1 a2 200 should be taken in the decode stage?

Continue speculating. Decode I2 and fetch PC+4

#### Pipelining Conditional Branches



| $I_1$          | 096 | ADD           |
|----------------|-----|---------------|
| $I_2$          | 100 | BEQ a1 a2 200 |
| $I_3$          | 104 | ADD           |
| T <sub>4</sub> | 304 | ADD           |

If the branch is taken

- kill the two following instructions
- the instruction at the decode stage is not valid

 $\Rightarrow$  stall signal is not valid

#### Pipelining Conditional Branches



| $I_1$ | 096 | ADD           |
|-------|-----|---------------|
| $I_2$ | 100 | BEQ a1 a2 200 |
| $I_3$ | 104 | ADD           |
| $I_4$ | 304 | ADD           |

Don't stall if the branch is taken. Why?

Instruction at the decode stage is invalid

September 20, 2023 MIT 6.5900 Fall 2023

# Control Equations for PC and IR Muxes

```
\begin{array}{ccc} IRSrc_D = \textit{Case} \ opcode_E \\ BEQ \cdot z & \Rightarrow \mathsf{nop} \\ \dots & \Rightarrow \textit{Case} \ opcode_D \\ & & JAL, \ JALR \Rightarrow \mathsf{nop} \\ \dots & \Rightarrow IM \end{array}
```

Give priority to the older instruction, i.e., execute stage instruction over decode stage instruction

```
\begin{split} IRSrc_E &= \textit{Case} \text{ opcode}_E \\ BEQ \cdot z &\Rightarrow \underset{}{\text{nop}} \\ \dots &\Rightarrow stall \cdot \text{nop} + !stall \cdot IR_D \end{split}
```

pc+4 is a
speculative guess

```
nop ⇒ Kill
pc+imm /rs1+imm⇒ Restart
pc+4 ⇒ Speculate
```

### Branch Pipeline Diagrams (resolved in execute stage)

```
time
                          t1 t2 t3 t4 t5 t6 t7 ....
                     t0
(I_1) 096: ADD IF_1 ID_1 EX_1 MA_1 WB_1
(I_2) 100: BEQ a1 a2 200 IF_2 ID_2 EX_2 MA_2 WB_2
(I_3) 104: ADD
                               IF<sub>3</sub> ID<sub>3</sub> hop nop nop
(I_4) 108:
                                          hop nop nop nop
(I_5) 304: ADD
                                          IF<sub>5</sub> ID<sub>5</sub> EX<sub>5</sub> MA<sub>5</sub> WB<sub>5</sub>
                     time
                     t0
                          t1
                             t2 t3 t4 t5 t6 t7 ....
                          I_2 I_3 I_4
               ΙF
                                          I_5
               ID
                               I_2 I_3 nop I_5
Resource
               EX
                                     I_2 nop nop I_5
Usage
               MA
                                          I_2 nop nop I_5
Diagram
               WB
                                               I_2 nop nop I_5
```

nop ⇒ pipeline bubble

# Reducing Branch Penalty (resolve in decode stage)

 One pipeline bubble can be removed if an extra comparator is used in the Decode stage



Pipeline diagram now same as for jumps

## Branch Delay Slots (expose control hazard to software)

- Change the ISA semantics so that the instruction that follows a jump or branch is always executed
  - gives compiler the flexibility to put in a useful instruction where normally a pipeline bubble would have resulted.

| $I_1$ | 096 | ADD           |                          |
|-------|-----|---------------|--------------------------|
| $I_2$ | 100 | BEQ a1 a2 200 | Delay slot instruction   |
| $I_3$ | 104 | ADD ←         | - executed regardless of |
| $I_4$ | 304 | ADD           | branch outcome           |

 Other techniques include branch prediction, which can dramatically reduce the branch penalty... to come later

# Handling Control Hazards due to Exceptions



- Instructions may suffer exceptions in different pipeline stages
- Must prioritize exceptions from earlier instructions



- Typical strategy: Record exceptions, process the first one to reach commit point (i.e., the point where architectural state is modified)
  - Pros/cons vs handling exceptions eagerly, like branches?

# Why an instruction may not be dispatched every cycle (CPI>1)

- Full bypassing may be too expensive to implement
  - Typically, all frequently used paths are provided
  - Some infrequently used bypass paths may increase cycle time and counteract the benefit of reducing CPI
- Loads have two-cycle latency
  - Instruction after load cannot use load result
  - MIPS-I ISA defined *load delay slots*, a software-visible pipeline hazard (compiler schedules independent instruction or inserts NOP to avoid hazard). Removed in MIPS-II.
- Conditional branches, jumps, and exceptions may cause bubbles
  - Kill instruction(s) following branch if no delay slots

Machines with software-visible delay slots may execute significant number of NOP instructions inserted by the compiler.

### Next lecture: Superscalar & Scoreboarded Pipelines